This notebook presents an analysis of the data on 17007 strategy games available on the Apple App Store, such as Clash of CLans, Plants vs Zombies, Pokemon GO and others. This dataset was acquired from Kaggle.com, and it was collected on the 3rd of August 2019 using the iTunes API.
With this dataset, we may be able to analyze what factors make a sucessful game.
To start this analysis, we first load the required packages (tidyverse, readr) and read the csv file provided by Kaggle.
if(!require(tidyverse)){install.packages("tidyverse")}
if(!require(readr)){install.packages("readr")}
if(!require(DT)){install.packages("DT")}
options(scipen=10000)
appstoreGamesFile = "data/appstore_games.csv"
appstoreGamesDF = read_csv(appstoreGamesFile) %>% rename_all(~str_replace_all(., "\\s+", ""))
summary(appstoreGamesDF)
## URL ID Name Subtitle
## Length:17007 Min. : 284921427 Length:17007 Length:17007
## Class :character 1st Qu.: 899654330 Class :character Class :character
## Mode :character Median :1112286228 Mode :character Mode :character
## Mean :1059613815
## 3rd Qu.:1286982837
## Max. :1475076711
##
## IconURL AverageUserRating UserRatingCount Price
## Length:17007 Min. :1.000 Min. : 5 Min. : 0.0000
## Class :character 1st Qu.:3.500 1st Qu.: 12 1st Qu.: 0.0000
## Mode :character Median :4.500 Median : 46 Median : 0.0000
## Mean :4.061 Mean : 3306 Mean : 0.8134
## 3rd Qu.:4.500 3rd Qu.: 309 3rd Qu.: 0.0000
## Max. :5.000 Max. :3032734 Max. :179.9900
## NA's :9446 NA's :9446 NA's :24
## In-appPurchases Description Developer AgeRating
## Length:17007 Length:17007 Length:17007 Length:17007
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Languages Size PrimaryGenre Genres
## Length:17007 Min. : 51328 Length:17007 Length:17007
## Class :character 1st Qu.: 22950144 Class :character Class :character
## Mode :character Median : 56768954 Mode :character Mode :character
## Mean : 115706430
## 3rd Qu.: 133027072
## Max. :4005591040
## NA's :1
## OriginalReleaseDate CurrentVersionReleaseDate
## Length:17007 Length:17007
## Class :character Class :character
## Mode :character Mode :character
##
##
##
##
As seen by the summary, there are 18 columns in this dataset:
We need to fix the typing of some columns, such as the release dates.
fixedAppstoreGamesDF <- appstoreGamesDF %>%
mutate(OriginalReleaseDate = as.Date(OriginalReleaseDate, "%d/%m/%Y")) %>%
mutate(CurrentVersionReleaseDate = as.Date(CurrentVersionReleaseDate, "%d/%m/%Y")) %>%
mutate(AgeRating = factor(AgeRating, levels=c('4+','9+', '12+', '17+')))
appstoreGamesDF <- fixedAppstoreGamesDF
datatable(appstoreGamesDF %>% select(-URL, -ID, -Subtitle, -IconURL, -Description, -Developer))
## Warning in instance$preRenderHook(instance): It seems your data is too big
## for client-side DataTables. You may consider server-side processing: https://
## rstudio.github.io/DT/server.html
summary(appstoreGamesDF)
## URL ID Name Subtitle
## Length:17007 Min. : 284921427 Length:17007 Length:17007
## Class :character 1st Qu.: 899654330 Class :character Class :character
## Mode :character Median :1112286228 Mode :character Mode :character
## Mean :1059613815
## 3rd Qu.:1286982837
## Max. :1475076711
##
## IconURL AverageUserRating UserRatingCount Price
## Length:17007 Min. :1.000 Min. : 5 Min. : 0.0000
## Class :character 1st Qu.:3.500 1st Qu.: 12 1st Qu.: 0.0000
## Mode :character Median :4.500 Median : 46 Median : 0.0000
## Mean :4.061 Mean : 3306 Mean : 0.8134
## 3rd Qu.:4.500 3rd Qu.: 309 3rd Qu.: 0.0000
## Max. :5.000 Max. :3032734 Max. :179.9900
## NA's :9446 NA's :9446 NA's :24
## In-appPurchases Description Developer AgeRating
## Length:17007 Length:17007 Length:17007 4+ :11806
## Class :character Class :character Class :character 9+ : 2481
## Mode :character Mode :character Mode :character 12+: 2055
## 17+: 665
##
##
##
## Languages Size PrimaryGenre Genres
## Length:17007 Min. : 51328 Length:17007 Length:17007
## Class :character 1st Qu.: 22950144 Class :character Class :character
## Mode :character Median : 56768954 Mode :character Mode :character
## Mean : 115706430
## 3rd Qu.: 133027072
## Max. :4005591040
## NA's :1
## OriginalReleaseDate CurrentVersionReleaseDate
## Min. :2008-07-11 Min. :2008-08-01
## 1st Qu.:2014-09-23 1st Qu.:2016-04-17
## Median :2016-07-09 Median :2017-07-24
## Mean :2016-03-04 Mean :2017-04-26
## 3rd Qu.:2017-12-07 3rd Qu.:2018-11-19
## Max. :2019-10-26 Max. :2019-10-26
##
Right now I have no hypotheses to check, but lets create some plots to see the current state of the games released on the app store.
First, the number of games released each year. We can see by the plot that the number of games released had been increasing up until 2016. 2017 and 2018 had fewer games released. 2019 is not yet over, so it may catch up to the previous years.
appstoreGamesDF %>%
select(OriginalReleaseDate) %>%
mutate(OriginalReleaseYear = format(OriginalReleaseDate, "%Y")) %>%
group_by(OriginalReleaseYear) %>%
summarise(count = n()) %>%
ggplot(aes(x=OriginalReleaseYear, y=count)) +
geom_col() +
geom_text(aes(label=count), vjust=-0.25) +
scale_y_continuous(expand = expand_scale(mult=c(0,0.05))) +
ylab("Number of games released") +
xlab("Release Year") +
theme_minimal()
Ploting the release year of the current version of the games doesn’t give us much information. At best, we can see that the majority of the games have had an update in the last 4 years.
appstoreGamesDF %>% select(-URL, -ID, -Subtitle, -IconURL, -Description) %>%
select(CurrentVersionReleaseDate) %>%
mutate(CurrentVersionRelease = format(CurrentVersionReleaseDate, "%Y")) %>%
group_by(CurrentVersionRelease) %>%
summarise(count = n()) %>%
ggplot(aes(x=CurrentVersionRelease, y=count)) +
geom_col() +
geom_text(aes(label=count), vjust=-0.25) +
scale_y_continuous(expand = expand_scale(mult=c(0,0.05))) +
theme_minimal()
## Number of games per user rating.
In this plot, we can see that the number of games per possible rating score increases in a curved fashin up until the 4.5 score. The perfect 5 score is much less common that 4 and 4.5.
unique(appstoreGamesDF$AverageUserRating)
## [1] 4.0 3.5 3.0 2.5 NA 2.0 4.5 1.5 5.0 1.0
appstoreGamesDF %>%
select(AverageUserRating) %>%
filter(!is.na(AverageUserRating)) %>%
group_by(AverageUserRating) %>%
summarise(count = n()) %>%
ggplot(aes(x=AverageUserRating, y=count)) +
geom_col() +
geom_text(aes(label=count), vjust=-0.25) +
scale_x_continuous(breaks = seq(1,5,by=0.5)) +
scale_y_continuous(expand = expand_scale(mult=c(0,0.05))) +
theme_minimal()
This plot is simple, it shows that games on the appstore tend to target all ages. Adult games (17+) are rare.
appstoreGamesDF %>%
select(AgeRating) %>%
arrange(AgeRating) %>%
group_by(AgeRating) %>%
summarise(count = n()) %>%
ggplot(aes(x=AgeRating, y=count)) +
geom_col() +
geom_text(aes(label=count), vjust=-0.25) +
scale_y_continuous(expand = expand_scale(mult=c(0,0.05))) +
theme_minimal()
## Number of Games per Language.
A game on the appStore may be localized in multiple languages. According to this plot, the two most popular languages for games are English (EN) and Chinese (ZH). This likely reflects the language proficiency of the userbase.
appstoreGamesDF %>%
select(ID, Languages) %>%
separate_rows(Languages, sep=",") %>%
drop_na(Languages) %>%
group_by(Languages) %>%
summarise(count = n()) %>%
arrange(desc(count)) %>%
top_n(20) %>%
ggplot(aes(x=reorder(Languages,desc(count)), y=count)) +
geom_col() +
geom_text(aes(label=count), vjust=-0.25, size=3.5) +
scale_y_continuous(expand = expand_scale(mult=c(0,0.05))) +
theme_minimal()
## Selecting by count
Similar to languages, a game mat have multiple genres. The two most common genres are “Strategy” and “Games”, which is natural since the dataset we are analyzing is about “Strategy Games”. Many games are also classified as “Entertainment”, which is not a game genre. The actual most popular game genre in this dataset is “Puzzle”.
appstoreGamesDF %>%
select(ID, Genres) %>%
separate_rows(Genres, sep=",") %>%
drop_na(Genres) %>%
group_by(Genres) %>%
summarise(count = n()) %>%
arrange(desc(count)) %>%
top_n(20) %>%
ggplot(aes(x=reorder(Genres,desc(count)), y=count)) +
geom_col() +
geom_text(aes(label=count), vjust=-0.25, size=3.5) +
scale_y_continuous(expand = expand_scale(mult=c(0,0.05))) +
theme_minimal() +
theme(axis.text.x = element_text(angle=90,vjust= 0.2,hjust=1))
## Selecting by count
## Number of free/not-free games
The majority of games in the appstore are free to play, as seen in this plot.
appstoreGamesDF %>%
select(Price) %>%
drop_na(Price) %>%
mutate(PriceRange = case_when(Price <= 0 ~ "Free",
TRUE ~ "Not Free"))%>%
mutate(PriceRange = factor(PriceRange, levels= c("Free", "Not Free"))) %>%
group_by(PriceRange) %>%
summarise(count = n()) %>%
ggplot(aes(x=PriceRange, y=count))+
geom_col()+
geom_text(aes(label=count), vjust=-0.25, size=3.5) +
scale_y_continuous(expand = expand_scale(mult=c(0,0.05))) +
theme_minimal()
Finally, I also tried to plot a histogram of the number of games per the number of user ratings However, the histograms is severely unbalanced, that is, most games have very low amounts of user ratings.
appstoreGamesDF %>%
select(UserRatingCount) %>%
filter(!is.na(UserRatingCount))%>%
filter(UserRatingCount>=10000)%>%
arrange(UserRatingCount) %>%
ggplot(aes(x=UserRatingCount)) +
scale_y_continuous(expand = expand_scale(mult=c(0,0.05))) +
geom_histogram(bins=10) +
theme_minimal() +
labs(x="Total Number of User Ratings", y="Number of Games")
I removed the “Games”, “Entertainment” because they are not game genres. I also removed “Strategy” because the dataset is about games in this genre.
appstoreGamesDF %>%
select(ID,AgeRating, Genres) %>%
separate_rows(Genres, sep=",", convert = TRUE) %>%
mutate(Genres = trimws(Genres)) %>%
filter(Genres != "Strategy" & Genres != "Games" & Genres != "Entertainment") %>%
drop_na(Genres) %>%
group_by(AgeRating,Genres) %>%
summarise(count = n()) %>%
arrange(desc(count)) %>%
top_n(n=5) %>%
#summarise(averageNumberOfLanguages = mean(numberOfLanguages)) %>%
ggplot(aes(x=AgeRating, y=count, fill=Genres)) +
geom_col(position = position_dodge(), width=0.9) +
#geom_text(aes(label=averageNumberOfLanguages), vjust=-0.25, size=3.5) +
scale_fill_brewer(palette="Set1") +
scale_y_continuous(expand = expand_scale(mult=c(0,0.05))) +
xlab("Age Rating") +
ylab("Number of Games") +
theme_minimal()
## Selecting by count
appstoreGamesDF %>%
select(ID,AgeRating, Languages) %>%
separate_rows(Languages, sep=",", convert = TRUE) %>%
mutate(Languages = trimws(Languages)) %>%
# filter(Genres != "Strategy" & Genres != "Games" & Genres != "Entertainment") %>%
drop_na(Languages) %>%
group_by(AgeRating,Languages) %>%
summarise(count = n()) %>%
arrange(desc(count)) %>%
top_n(n=7) %>%
ggplot(aes(x=AgeRating, y=count, fill=Languages)) +
geom_col(position = position_dodge(), width=0.9) +
geom_text(aes(label=Languages), vjust=-0.25, size=3.5, position = position_dodge(0.9)) +
scale_fill_brewer(palette="Set1") +
scale_y_continuous(expand = expand_scale(mult=c(0,0.05))) +
theme_minimal() +
xlab("Age Rating") +
ylab("Number of Games") +
theme(legend.position = "none")
## Selecting by count
We know that English is clearly the most popular language, followed by Chinese (ZH). Since it’s not possible to see the difference between the other language columns, lets create the same plot without English and Chinese.
appstoreGamesDF %>%
select(ID,AgeRating, Languages) %>%
separate_rows(Languages, sep=",", convert = TRUE) %>%
mutate(Languages = trimws(Languages)) %>%
filter(Languages != "EN" & Languages != "ZH") %>%
drop_na(Languages) %>%
group_by(AgeRating,Languages) %>%
summarise(count = n()) %>%
arrange(desc(count)) %>%
top_n(n=7) %>%
ggplot(aes(x=AgeRating, y=count, fill=Languages)) +
geom_col(position = position_dodge(), width=0.9) +
geom_text(aes(label=Languages), vjust=-0.25, size=3.5, position = position_dodge(0.9)) +
scale_fill_brewer(palette="Set1") +
scale_y_continuous(expand = expand_scale(mult=c(0,0.05))) +
theme_minimal() +
xlab("Age Rating") +
ylab("Number of Games") +
theme(legend.position = "none")
## Selecting by count
Unfortunately, there is no information regarding the revenue these games make. We can only speculate that any user that reviews a non-free game has bought it at least once. Thus, we can have model of how much money a game has made compared to others. Of course, this does not consider games with in-app purchases, which is not only the the most common type of game in the Apple Store, but they are also the games that usually make the most amount of money in the mobile gaming community according to the news.
With this crude model, we can relate how most variables impact the revenue of a game: e.g., the amount of languages, a specific language, the genres, the release date, the age rating, the app size, and maybe others.
Unfortunatly it is quite difficult to visualize the distribution between these two variables. The values of quartiles are very near one another, so the boxplots are not really useful. The distribution of languages is better shown by ploting a point for each game with some jitter. It was also necessary to crop the y-scale, as some games support more than 90 languages.
# appstoreGamesDF %>%
# select(ID,AgeRating, Languages) %>%
# separate_rows(Languages, sep=",") %>%
# drop_na(Languages) %>%
# group_by(ID,AgeRating) %>%
# summarise(numberOfLanguages = n())
# arrange(desc(numberOfLanguages))
appstoreGamesDF %>%
select(ID,AgeRating, Languages) %>%
separate_rows(Languages, sep=",") %>%
drop_na(Languages) %>%
group_by(ID,AgeRating) %>%
summarise(numberOfLanguages = n()) %>%
#ungroup %>%
#group_by(AgeRating) %>%
#summarise(averageNumberOfLanguages = mean(numberOfLanguages)) %>%
ggplot(aes(x=AgeRating, y=numberOfLanguages)) +
geom_boxplot() +
geom_jitter(width = 0.3) +
#geom_text(aes(label=averageNumberOfLanguages), vjust=-0.25, size=3.5) +
scale_y_continuous(expand = expand_scale(mult=c(0,0.05))) +
coord_cartesian(ylim=c(0,90)) +
xlab("Age Rating") +
ylab("Number of Supported Languages") +
theme_minimal()
Similarly to the previous plot, the best way I found to visualize this question is by plotting points for each game. In this case, it seems that age rating has no inflence on the number of genres. The reduced number of points in higher age ratings probably reflects the number of games per each rating.
appstoreGamesDF %>%
select(ID,AgeRating, Genres) %>%
separate_rows(Genres, sep=",") %>%
drop_na(Genres) %>%
group_by(ID,AgeRating) %>%
summarise(numberOfGenres = n()) %>%
#summarise(averageNumberOfLanguages = mean(numberOfLanguages)) %>%
ggplot(aes(x=AgeRating, y=numberOfGenres)) +
geom_boxplot() +
geom_jitter(width = 0.3) +
#geom_text(aes(label=averageNumberOfLanguages), vjust=-0.25, size=3.5) +
scale_y_continuous(expand = expand_scale(mult=c(0,0.05))) +
xlab("Age Rating") +
ylab("Total Number of Genres") +
theme_minimal()
To compare the User ratings for each Age Rating category, I summed the total amount of user ratings for each rating level and then calculated the ratio of that amount to the total amount of user ratings. This is displayed in the stacked bar chart below.
appstoreGamesDF %>%
drop_na(AverageUserRating) %>%
arrange(AverageUserRating) %>%
pull(AverageUserRating) %>%
unique() -> AverageUserRatingLevels #Get a vector containing all possible user rating levels in sequential order.
appstoreGamesDF %>%
select(ID,AgeRating, AverageUserRating) %>%
drop_na(AverageUserRating) %>%
mutate(AverageUserRating = factor(AverageUserRating, levels = AverageUserRatingLevels)) %>%
group_by(AgeRating, AverageUserRating) %>%
summarise(count = n()) %>%
mutate(freq = count / sum(count)) %>%
ggplot(aes(x=reorder(AgeRating,desc(AgeRating)), y=freq, fill=AverageUserRating)) +
geom_col(position = position_stack(reverse = TRUE)) +
scale_fill_brewer(palette = "RdYlGn") +
geom_text(aes(label=count), size=4 ,position=position_stack(vjust = .5, reverse = TRUE)) +
theme_minimal() +
xlab("Age Rating") +
ylab("Proportion (%)") +
labs(fill="Average\nUser Rating") +
coord_flip()
# Linear Regressions
There aren’t many possible variables to test for linear regression.
Any 88unordered data88 is unsuitable for linear regressions. 88In-app Purchases88 is also not a good varaible to anaylze since it is a list the prices, that can have any number of elements and can have repeated values.
The most likely candidates for a suitable linear regression are Price, User Rating Count, Average User Rating and Size . Starting with the first two:
A significant linear regression between User Rating Count and Price was not found.
reg <- lm(data=appstoreGamesDF %>% drop_na(Price, UserRatingCount), UserRatingCount~Price)
par(mfrow=c(2,2))
plot(reg)
par(mfrow=c(1,1))
summary(reg)
##
## Call:
## lm(formula = UserRatingCount ~ Price, data = appstoreGamesDF %>%
## drop_na(Price, UserRatingCount))
##
## Residuals:
## Min 1Q Median 3Q Max
## -3413 -3403 -3341 -2780 3029316
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3418.1 500.2 6.834 0.00000000000889 ***
## Price -195.3 201.5 -0.969 0.332
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 42320 on 7559 degrees of freedom
## Multiple R-squared: 0.0001243, Adjusted R-squared: -7.97e-06
## F-statistic: 0.9397 on 1 and 7559 DF, p-value: 0.3324
appstoreGamesDF %>%
drop_na(Price, UserRatingCount) %>%
select(Price, UserRatingCount) %>%
ggplot(aes(x=Price, y=UserRatingCount))+
geom_point() +
geom_smooth(method="lm") +
coord_cartesian(ylim = c(-25000,100000)) +
theme_minimal()
A significant linear regression between Average user Rating and Price was not found.
reg <- lm(data=appstoreGamesDF %>% drop_na(Price, AverageUserRating), Price~AverageUserRating)
par(mfrow=c(2,2))
plot(reg)
par(mfrow=c(1,1))
summary(reg)
##
## Call:
## lm(formula = Price ~ AverageUserRating, data = appstoreGamesDF %>%
## drop_na(Price, AverageUserRating))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.575 -0.571 -0.571 -0.570 139.419
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.576715 0.152703 3.777 0.00016 ***
## AverageUserRating -0.001332 0.036976 -0.036 0.97126
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.416 on 7559 degrees of freedom
## Multiple R-squared: 1.717e-07, Adjusted R-squared: -0.0001321
## F-statistic: 0.001298 on 1 and 7559 DF, p-value: 0.9713
appstoreGamesDF %>%
drop_na(Price, AverageUserRating) %>%
select(Price, AverageUserRating) %>%
ggplot(aes(x=AverageUserRating, y=Price))+
geom_jitter(width=0.15) +
geom_smooth(method="lm") +
coord_cartesian(ylim = c(0,60)) +
theme_minimal()
There is a significant linear regression between the price of a game and its size in bytes. This seems plausible, bigger games may have a higher price due to the effort spent to create all that data. However, most games are free and earn their revenue through In-App Purchases. Alsoo, not-free games usually have standarized pricing. So the regression has a very small slope. bu
reg <- lm(data=appstoreGamesDF %>% drop_na(Price, Size), Price~Size)
par(mfrow=c(2,2))
plot(reg)
par(mfrow=c(1,1))
summary(reg)
##
## Call:
## lm(formula = Price ~ Size, data = appstoreGamesDF %>% drop_na(Price,
## Size))
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.483 -0.807 -0.715 -0.680 179.221
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.6639718843973 0.0691516404318 9.602 < 0.0000000000000002 ***
## Size 0.0000000012965 0.0000000002968 4.368 0.0000126 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.832 on 16981 degrees of freedom
## Multiple R-squared: 0.001122, Adjusted R-squared: 0.001064
## F-statistic: 19.08 on 1 and 16981 DF, p-value: 0.0000126
appstoreGamesDF %>%
drop_na(Price, Size) %>%
mutate(Size = Size/1000000) %>%
select(Price, Size) %>%
ggplot(aes(x=Size, y=Price))+
geom_point() +
geom_smooth(method="lm") +
xlab("Size (MB)") +
theme_minimal()
There were other questions I thought that could have been analyzed/plotted:
Is there a correlation between age rating and user rating count?
Is there a correlation between price and the presence of In-app Purchases,
Is there a correlation between original release date and current version release date?